1 About the Project

1.1 Croudfunding

The phenomenon of crowdfunding, an alternative financing approach, involves raising funds for a new business ventures via small amounts of capital from a large number of individuals. Crowdfunding is a relatively new phenomenon enabled by wide access to social media and internet-based financial technology services (Fintech)., It makes obtaining funding more accessible for entrepreneurs and small businesses, as compared to traditional banking and lending services.

Little academic research has been conducted on crowdfunding, and there are many interesting areas for investigation. From a financial perspective, it is disrupting the small- and medium- enterprise (SME) lending market. Economically, it may be changing the prevalence and makeup of SMEs. In terms of marketing, it gives consumers a greater say in the products they would like to see available, but also exposes them to increased risk. Regarding information and technology, it is enabling innovations on a public platform.

Our project would entail exploration of datasets regarding Indiegogo and Kickstarter projects., The primary project goal would be to construct a model that predicts crowdfunding success. In order to accomplish this, additional data sources may be required regarding consumer demand, small businesses, etc. From that predictive model, we seek to make recommendations (1) to entrepreneurs, regarding when and how to employ crowdfunding for project financing, and (2) to the lending services and venture capital industries, regarding how their business models should react.

From this independent study course, I expect to: * Use multiple publicly-available crowdfunding datasets and R programming, * Clean data and conduct primary research to obtain additional variables (as needed), * Apply statistical analysis techniques to describe trends, * Construct a model to predict crowdfunding success, and * Prescribe best practices for entrepreneurs to leverage crowdfunding.

1.2 Data Collection and Cleaning

Raw data were obtained from Kickstarter using a custom python web scraping function. These data included 50,596 projects and 120 variables pertaining to the projects and their creators from the launch of Kickstarter in 2009 through December 2013.

Of these projects, possible outcome states included failed, succesful, suspended, canceled, and purged. We excluded 1,246 projects with suspended, canceled, or purged outcome states. We excluded four additional projects whose funding state is inaccurate in terms of the amount pledged (e.g., state is listed as failed when the amout pledged exceeded the goal). The final dataset for analysis contained 49,350 projects.

Of the original 120 variables, 41 contained meaningful information for analyses. Selecting from, transforming, and performing additional computations resulted in 29 variables used in subsequent exploratory analysis and machine learning. These variables are described in the Data Dictionary below.

2 Key Findings

2.1 percent_funded

Non-normal distribution shows that most projects over ~75% of their goal wind up being successful. There may be outside manipulation happening due to Kickstarter promotions of projects that are near their goals, or personal donations by the creators and/or creators personal connections.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.2 staff_pick

This flag denotes projects that have received the “Projects We Love” badge and get prominently features on the website, newsletters, and blogs. Kickstarter staff clearly has a great eye for promising projects and/or some strong marketing impact.

’staff_pickhas exciting implications for the continuation of this project. As machine learning applications become more widely used and increase in efficacy, often the benchmark is, "Can it do better than a human?"staff_pick` serves as a proxy for the best human judgment has to offer and therefore serves as a benchmark for our machine learning.

It is important to note that we do not anticipate being able to replicate this success rate as there is a boosting effect from the marketing efforts that should increase the success rate well beyond Day Zero probabilities.

Kickstarter - Projects We Love

## # A tibble: 2 x 3
##   staff_pick `n()` `mean(funded)`
##        <dbl> <int>          <dbl>
## 1          0 44081          0.524
## 2          1  5269          0.842

3 Machine Learning Models

3.1 LASSO

set.seed(1)
n <- nrow(df_engr)
shuffled_df <- df_engr[sample(n), ]
train_indices <- 1:round(0.8 * n)
train <- shuffled_df[train_indices, ]
test_indices <- (round(0.8 * n) + 1):n
test <- shuffled_df[test_indices, ]
rm(shuffled_df); rm(n); rm(test_indices); rm(train_indices);

train_y <- train$funded
train_x <- model.matrix(funded ~ 
  campaign_duration +
  usa +
  social_media_count +
  photo_key +
  video_status +
  mo_launched +
  category +
  goal_20 +
  description_length_10 +
  reward_length_10, data = train)


test_y <- test$funded
test_x <- model.matrix(funded ~ 
  campaign_duration +
  usa +
  social_media_count +
  photo_key +
  video_status +
  mo_launched +
  category +
  goal_20 +
  description_length_10 +
  reward_length_10, data = test)

cvfit <- cv.glmnet(x=train_x, y=train_y, alpha = 1)

coef(cvfit, s = "lambda.min")
## 71 x 1 sparse Matrix of class "dgCMatrix"
##                                                     1
## (Intercept)                               0.664914095
## (Intercept)                               .          
## campaign_duration                        -0.001999695
## usa                                       0.130496091
## social_media_count1                       0.007694106
## social_media_count2                      -0.018798148
## social_media_count3                      -0.012401064
## photo_key                                -0.325786082
## video_status                              0.114512423
## mo_launched02                             0.012042709
## mo_launched03                             0.009070564
## mo_launched04                            -0.005003108
## mo_launched05                            -0.032815830
## mo_launched06                            -0.020027388
## mo_launched07                            -0.042405213
## mo_launched08                            -0.044413291
## mo_launched09                            -0.020568204
## mo_launched10                            -0.019607878
## mo_launched11                            -0.022795763
## mo_launched12                            -0.030831188
## categorycomics                            0.404898773
## categorycrafts                            0.305221540
## categorydance                             0.296892593
## categorydesign                            0.291811031
## categoryfashion                           0.469312227
## categoryfilm&video                        0.175696086
## categoryfood                              0.470958744
## categorygames                            -0.032333145
## categoryjournalism                        0.518167715
## categorymusic                             0.086738376
## categoryphotography                       0.448134568
## categorypublishing                       -0.092499720
## categorytechnology                        0.050521804
## categorytheater                           0.434534961
## goal_20(500,750]                         -0.057788659
## goal_20(750,1e+03]                       -0.117671705
## goal_20(1e+03,1.5e+03]                   -0.131514952
## goal_20(1.5e+03,1.8e+03]                 -0.129787711
## goal_20(1.8e+03,2e+03]                   -0.179015169
## goal_20(2e+03,2.5e+03]                   -0.201020201
## goal_20(2.5e+03,3e+03]                   -0.205210875
## goal_20(3e+03,3.5e+03]                   -0.222480198
## goal_20(3.5e+03,4.5e+03]                 -0.263936072
## goal_20(4.5e+03,5e+03]                   -0.325905098
## goal_20(5e+03,5.2e+03]                   -0.392920567
## goal_20(5.2e+03,7e+03]                   -0.343285549
## goal_20(7e+03,8e+03]                     -0.349492016
## goal_20(8e+03,1e+04]                     -0.414197836
## goal_20(1e+04,1.2e+04]                   -0.433992904
## goal_20(1.2e+04,1.6e+04]                 -0.466483730
## goal_20(1.6e+04,2.5e+04]                 -0.525832603
## goal_20(2.5e+04,5e+04]                   -0.614624699
## goal_20(5e+04,2.15e+07]                  -0.752777020
## description_length_10(754,1.11e+03]       0.047569711
## description_length_10(1.11e+03,1.44e+03]  0.094708324
## description_length_10(1.44e+03,1.81e+03]  0.115029118
## description_length_10(1.81e+03,2.22e+03]  0.138664965
## description_length_10(2.22e+03,2.74e+03]  0.166152398
## description_length_10(2.74e+03,3.46e+03]  0.170785197
## description_length_10(3.46e+03,4.58e+03]  0.175105438
## description_length_10(4.58e+03,6.64e+03]  0.228388846
## description_length_10(6.64e+03,1.4e+05]   0.287361125
## reward_length_10(2.62e+03,3.71e+03]       0.063860414
## reward_length_10(3.71e+03,4.61e+03]       0.111644718
## reward_length_10(4.61e+03,5.49e+03]       0.139014324
## reward_length_10(5.49e+03,6.43e+03]       0.181298204
## reward_length_10(6.43e+03,7.52e+03]       0.198659617
## reward_length_10(7.52e+03,8.86e+03]       0.211436713
## reward_length_10(8.86e+03,1.08e+04]       0.259518926
## reward_length_10(1.08e+04,1.45e+04]       0.271572314
## reward_length_10(1.45e+04,1.37e+05]       0.351476224
mean(test_y)
## [1] 0.5593718
mean(test_y == as.numeric(predict(cvfit, s = "lambda.min", test_x, type = "response") >= .5))
## [1] 0.7014184

3.2 Decision Tree and/or Random Forest

set.seed(1)
tree <- rpart(funded ~ 
  campaign_duration +
  usa +
  social_media_count +
  photo_key +
  video_status +
  mo_launched +
  category +
  goal +
  description_length +
  reward_length, data = train)

summary(tree)
## Call:
## rpart(formula = funded ~ campaign_duration + usa + social_media_count + 
##     photo_key + video_status + mo_launched + category + goal + 
##     description_length + reward_length, data = train)
##   n= 39480 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.03433136      0 1.0000000 1.0000565 0.001159178
## 2 0.03397122      1 0.9656686 0.9596011 0.002306328
## 3 0.01522393      2 0.9316974 0.9327278 0.002656245
## 4 0.01486580      3 0.9164735 0.9189171 0.002906249
## 5 0.01000000      4 0.9016077 0.9027850 0.003075436
## 
## Variable importance
##      reward_length               goal           category 
##                 42                 25                 16 
## description_length       video_status  campaign_duration 
##                 10                  3                  3 
## 
## Node number 1: 39480 observations,    complexity param=0.03433136
##   mean=0.5571935, MSE=0.2467289 
##   left son=2 (19858 obs) right son=3 (19622 obs)
##   Primary splits:
##       goal               < 4320.84  to the right, improve=0.03433136, (0 missing)
##       category           splits as  LRRRRRRRLRRRLLR, improve=0.03196349, (0 missing)
##       reward_length      < 4056.5   to the left,  improve=0.02676679, (0 missing)
##       video_status       < 0.5      to the left,  improve=0.02034017, (0 missing)
##       description_length < 1119.5   to the left,  improve=0.01583421, (0 missing)
##   Surrogate splits:
##       reward_length      < 7034.5   to the right, agree=0.639, adj=0.273, (0 split)
##       description_length < 2632.5   to the right, agree=0.628, adj=0.251, (0 split)
##       category           splits as  RLRRLLLLLLRLRLR, agree=0.596, adj=0.187, (0 split)
##       campaign_duration  < 29.995   to the right, agree=0.569, adj=0.133, (0 split)
##       video_status       < 0.5      to the right, agree=0.560, adj=0.115, (0 split)
## 
## Node number 2: 19858 observations,    complexity param=0.03397122
##   mean=0.4657065, MSE=0.248824 
##   left son=4 (4221 obs) right son=5 (15637 obs)
##   Primary splits:
##       reward_length      < 4814.5   to the left,  improve=0.06697005, (0 missing)
##       description_length < 1925.5   to the left,  improve=0.04453456, (0 missing)
##       category           splits as  LRRRRRRRLRRRLLR, improve=0.04166823, (0 missing)
##       video_status       < 0.5      to the left,  improve=0.03845977, (0 missing)
##       goal               < 18664.1  to the right, improve=0.02109630, (0 missing)
##   Surrogate splits:
##       description_length < 894.5    to the left,  agree=0.803, adj=0.073, (0 split)
##       video_status       < 0.5      to the left,  agree=0.788, adj=0.004, (0 split)
##       goal               < 2500000  to the right, agree=0.788, adj=0.001, (0 split)
##       campaign_duration  < 7.49     to the left,  agree=0.788, adj=0.001, (0 split)
## 
## Node number 3: 19622 observations,    complexity param=0.0148658
##   mean=0.6497809, MSE=0.2275657 
##   left son=6 (6380 obs) right son=7 (13242 obs)
##   Primary splits:
##       reward_length      < 4056.5   to the left,  improve=0.03242913, (0 missing)
##       video_status       < 0.5      to the left,  improve=0.02657808, (0 missing)
##       category           splits as  LRRRRRRRLRLRLLR, improve=0.02453957, (0 missing)
##       description_length < 1066.5   to the left,  improve=0.01988858, (0 missing)
##       campaign_duration  < 29.785   to the right, improve=0.01356609, (0 missing)
##   Surrogate splits:
##       description_length < 787.5    to the left,  agree=0.697, adj=0.068, (0 split)
##       video_status       < 0.5      to the left,  agree=0.684, adj=0.029, (0 split)
##       goal               < 331.5    to the left,  agree=0.679, adj=0.011, (0 split)
##       campaign_duration  < 7.745    to the left,  agree=0.675, adj=0.002, (0 split)
## 
## Node number 4: 4221 observations
##   mean=0.2172471, MSE=0.1700508 
## 
## Node number 5: 15637 observations,    complexity param=0.01522393
##   mean=0.5327748, MSE=0.2489258 
##   left son=10 (6497 obs) right son=11 (9140 obs)
##   Primary splits:
##       category           splits as  LRRRRRRRLRRRLLR, improve=0.03809787, (0 missing)
##       goal               < 15525    to the right, improve=0.03077773, (0 missing)
##       reward_length      < 9565     to the left,  improve=0.01700483, (0 missing)
##       video_status       < 0.5      to the left,  improve=0.01567898, (0 missing)
##       description_length < 1981.5   to the left,  improve=0.01366007, (0 missing)
##   Surrogate splits:
##       description_length < 5760     to the right, agree=0.637, adj=0.126, (0 split)
##       usa                < 0.5      to the left,  agree=0.592, adj=0.017, (0 split)
##       goal               < 110555.5 to the right, agree=0.591, adj=0.015, (0 split)
##       video_status       < 0.5      to the left,  agree=0.590, adj=0.013, (0 split)
## 
## Node number 6: 6380 observations
##   mean=0.5260188, MSE=0.249323 
## 
## Node number 7: 13242 observations
##   mean=0.7094095, MSE=0.2061477 
## 
## Node number 10: 6497 observations
##   mean=0.4172695, MSE=0.2431557 
## 
## Node number 11: 9140 observations
##   mean=0.6148796, MSE=0.2368027
prp(tree, extra = 1, box.palette = "auto")

printcp(tree)
## 
## Regression tree:
## rpart(formula = funded ~ campaign_duration + usa + social_media_count + 
##     photo_key + video_status + mo_launched + category + goal + 
##     description_length + reward_length, data = train)
## 
## Variables actually used in tree construction:
## [1] category      goal          reward_length
## 
## Root node error: 9740.9/39480 = 0.24673
## 
## n= 39480 
## 
##         CP nsplit rel error  xerror      xstd
## 1 0.034331      0   1.00000 1.00006 0.0011592
## 2 0.033971      1   0.96567 0.95960 0.0023063
## 3 0.015224      2   0.93170 0.93273 0.0026562
## 4 0.014866      3   0.91647 0.91892 0.0029062
## 5 0.010000      4   0.90161 0.90279 0.0030754
plotcp(tree)

index <- which.min(tree$cptable[ , "xerror"])
tree_min <- tree$cptable[index, "CP"]

pruned_tree <- prune(tree, cp = tree_min)
prp(pruned_tree, extra = 1, box.palette = "auto")

mean(test$funded)
## [1] 0.5593718
mean(test$funded == as.numeric(predict(pruned_tree, newdata = test) >= .5))
## [1] 0.6521783

3.3 Classification Model

Our final task in text analysis was to propose a mechanism for predictive binary classification of project success or failure based on project description.

To prepare to fit the model, we considered both unigrams (single terms) and bigrams and filtered for words that appear in at least 500 project descriptions. We calculated tf-idf and formatted the results in a document-term matrix. We also created train and test datasets based on an 80/20% split of the data.

We tested Naïve Bayes and Random Forest classification models.

4 Text Analysis

The variable full_description contains the complete project description from kickstarter. Unstructured data, such as this text variable, require more cleaning and transformation to be useful, but can potentially be a source of rich information. Our application of text analysis had three primary motives: 1. Examine word frequency with word counts and wordclouds 2. Contruct topic models 3. Binary calssification to predict project funding status

4.1 Examine Word Frequency

We began by transforming the strings of text in full_description into a data frame with one word per row. We then removed English stop words, common words that carry little semantic meaning and are thus immaterial to analyses (e.g., “and”, “the”, “of”). Finally, we determined word counts for: * the entire dataset, * only successful projects, and * only failed projects.

The graph below shows the words that occur over 25,000 times in the full description text of all projects.

Next, we examined the correlation between word proportions of successful and failed project descriptions. Word proportion represents the percentage of time that a given word is used out of the total number of words in the document. In this case, the documents are the collection of all successful project descriptions and all failed project descriptions. We observed, both visually and in terms of Pearson’s correlation coefficient, that the terms used in successful and failed project descriptions were overwhelmingly similar.

## 
##  Pearson's product-moment correlation
## 
## data:  proportion and failed
## t = 1167.9, df = 111150, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.961139 0.962025
## sample estimates:
##       cor 
## 0.9615845

Another way to visualize word frequency is by constructing wordclouds, which scale the size of text of a word to match its frequency in the document relative to other words’ frequencies. We constructed a wordcloud for the descriptions from the entire dataset. We were not surprised to see that “project”, “kickstarter”, and “goal” were among the most frequent terms used.

Wordclouds can be a useful way to visually observe differences in word variety and frequency between different groups of documents. Although they cannot be used in subsequent modeling, they are a tool for understanding unstructured text data and formulating hypotheses.

Therefore, we grouped our dataset into documents * by year to identify trends over time, and * by funded to identify differences between successful and failed projects.

Prior to generating the wordclouds, we also created a custom set of stop words to weed out common terms in our dataset that could mask points of distinction between documents.

In the wordclouds by year, we see that music was initially the most prevalent in 2009, but film began to emerge as the predominant category 2010-2011. In 2012-2013, games appear as the biggest category. These wordclouds also give us a hint regarding the variety of projects. From 2009-2011, the wordclouds become larger and word frequency is less concentrated around the same terms. Abruptly in 2012, the projects seems to become less disparate, but in 2013 variety increases again. This suggests that the degree of project variety on kickstarter may be cyclical; this seems logical as artists and entrepreneurs in the same field turn to kickstarter after hearing about colleagues’ successes. However, more years of data are needed to verify the hypothesis of three-year periodicity.

## [1] 2009

## [1] 2010

## [1] 2011

## [1] 2012

## [1] 2013

In the wordclouds by funding status, we observed a high degree of similarity in both terms and frequency between successful and failed projects. Books seemed more likely to fail due to the higher prevalence of “book” in the failed wordcloud. There also seemed to be more variety in the successful wordcloud, perhaps indicating richer project descriptions. However, it seems that high world frequency may not be the best delineator of successful versus failed projects.

## [1] "failed"

## [1] "successful"

Sometimes the best way to determine points of difference between two similar documents are the terms which are unique between the two documents, rather than the most frequent terms. For example, two books written by the same author would likely generate similar wordclouds, yet the unique characters and places in the books would enable us to detect which book is which.

To see if this might be the case in our collection of successful and failed projects, we examined the term frequency-inverse document frequency (tf-idf). tf looks for terms that are common; idf decreases the weight placed on commonly used terms in the collection and increases the weight placed on words that are not commonly used in the collection (i.e., common in a few documents). To remove nonsensical words from the analysis, we only considered words with a frequency of greater than 500, which is a reasonably low cutoff in a dataset with 700,000+ unique terms.

The results of this analysis suggest that board games and film are likely to be successful (dice, unlocked, filmmaker(s), expansion, boards, filmmaking, premiere). However, although the games category overall had a high success rate, it appears that games involving war and violence were less likely to receive funding (weapon, battles, security, agent), as were online games (multiplayer, server, playable, modes, animations).

4.2 Topic Modeling

The analyses in the previous section have focused on the “bag-of-words” approach and word frequency as a method for natural language processing, the means by which computers make sense of human language. Although this is a common and useful approach, there are other useful ways to describe text data.

One such method is topic modeling. Topic models assume that word or groups of words (called n-grams) which appear frequently together in a dataset are explained by underlying, unobserved groups(called topics). By examining word or n-gram overlap in the documents comprising a dataset, these topics can be detected. Although the computer cannot provide a semantic label for the topics, a human who is familiar with the dataset could examine the top words and determine a theme.

4.3 Latent Dirichlet allocation

We chose Latent Dirichlet allocation (LDA) as our statistical model for topic detection. LDA examines text by word frequency and co-occurence in documents, which are individual project descriptions in our case. LDA assumes that each document covers a small number of topics and a small set of words it uses frequently, and so it is good at assigning documents to topics.

To feed data into the model, we first processed the text to transform it to lowercase, remove punctuation, and remove stop words. In this section, we also performed word stemming, which groups words together that have the same root but different suffixes. This process helps ensure that words with the same semantic meaning, but different verb conjugations and the like, are assessed as the same word. As a result, our results show some incomplete word stems.

After processing the text, we used it to generate documents, a vocabulary of terms in the dataset, metadata to construct the model. Consistent with our tf-idf analysis above, we only considered terms that appeared in at least 500 documents. We ran iterations of the LDA model specifying both 20 and 40 topics. The model did not reach convergence over 10 or 20 iterations; however, meaningful topics emerged with 20 iterations over 40 topics.

Visualizing the results of our topic model, we see some meaningful topics emerge, some centered on the mechanisms of the platform, and others identifying product categories or subcategories.

For example, Topic 1 could be labeled Funding Requests and includes terms like “help”, “money”, “donate”, “dollar”, “goal”, “buck”, and “reach”. Topic 7 is all about the rewards provided to backers if the project is successful: “pledge”, “reward”, “level”, “backer”, “goal”, and “ship.”

## Topic 1 Top Words:
##       Highest Prob: art, artist, work, paint, piec, will, creat 
##       FREX: paint, art, exhibit, galleri, artist, portrait, piec 
##       Lift: exhibit, paint, galleri, canva, portrait, painter, curat 
##       Score: art, paint, artist, exhibit, galleri, piec, work 
## Topic 7 Top Words:
##       Highest Prob: tour, travel, show, new, citi, road, will 
##       FREX: san, tour, road, trip, van, travel, francisco 
##       Lift: diego, las, detroit, van, coast, san, seattl 
##       Score: tour, travel, citi, los, san, show, road

On the other hand, Topic 6 seems to describe a certain subcategory of Film & Video and could be labeled Web Series with terms like “show”, “series”, “episode”, “season”, “pilot”, “web”, “anime”, and “comedy.” Topic 24 seems to describe a subcategory of Journalism and could be labeled Periodicals with terms like “issue”, “magazine”, “media”, “interview”, “article”, “content”, “journal”, and “print.”

## Topic 6 Top Words:
##       Highest Prob: school, learn, student, children, kid, educ, program 
##       FREX: student, school, children, educ, teach, kid, teacher 
##       Lift: teacher, classroom, teach, student, educ, children, school 
##       Score: school, student, children, educ, kid, learn, teach 
## Topic 24 Top Words:
##       Highest Prob: use, can, develop, app, power, devic, control 
##       FREX: app, devic, softwar, user, code, applic, iphon 
##       Lift: usb, batteri, devic, app, hardwar, tablet, softwar 
##       Score: app, devic, user, softwar, iphon, usb, develop

The theme of the projects is clear from some topics, although the type of project is not easily distinguished. For example, Topic 30 is about Fantasy, but could span several types of projects. The same is true of Topic 31 (Christianity), Topic 35 (Family), and Topic 39 (Outer Space).

## Topic 30 Top Words:
##       Highest Prob: team, compani, market, busi, product, industri, design 
##       FREX: busi, compani, market, industri, sport, team, sale 
##       Lift: sport, busi, hors, inc, compani, len, consult 
##       Score: team, compani, market, busi, product, industri, design 
## Topic 31 Top Words:
##       Highest Prob: music, play, musician, song, band, sound, guitar 
##       FREX: guitar, jazz, music, sing, musician, piano, instrument 
##       Lift: guitarist, jazz, piano, bass, melodi, guitar, sing 
##       Score: music, song, musician, guitar, band, play, jazz 
## Topic 35 Top Words:
##       Highest Prob: project, will, fund, work, creat, complet, anim 
##       FREX: anim, fund, project, complet, profession, hire, necessari 
##       Lift: cartoon, anim, pet, fruition, portion, exposur, contract 
##       Score: project, anim, fund, creat, will, complet, product 
## Topic 39 Top Words:
##       Highest Prob: card, stretch, goal, set, add, pledg, will 
##       FREX: card, stretch, deck, pack, add, box, pdf 
##       Lift: deck, miniatur, card, dice, stretch, pdf, pack 
##       Score: card, deck, stretch, pledg, dice, pdf, add

We also visualized the correlations between the 40 topics. The green nodes indicate topics, and the dashed lines represent relatedness between topics. The length of the dashed lines indicate the degree of similarity between two topics. Our topic models are highly related to one another, both in terms of the number of connections and the distance of connections.

In natural language processing, data often arrive with little metadata to categorize the text. Although we have project category in our dataset, we have no mechanism, aside from text mining, to determine project themes, which may be highly related to success or failure. Therefore, the results of the LDA model could be useful for classification of successful and unsuccessful projects.

4.4 Acknowledgements

The following resources were invaluable to the completion of this section: * Text Mining with R: A Tidy Approach (Silge & Robinson, 2018; https://www.tidytextmining.com) * Class notes from Prof. Ujjal Mukherjee (University of Illinois at Urbana-Champgin, Gies College of Business) * stm: R Package for Structural Topic Models (Roberts, Stewart, & Tingley; https://cran.r-project.org/web/packages/stm/vignettes/stmVignette.pdf) * Binary text classification with Tidytext and caret (Hvitfeldt, 2018; https://www.hvitfeldt.me/2018/03/binary-text-classification-with-tidytext-and-caret/) * naivebayes package documentation (ftp://cran.r-project.org/pub/R/web/packages/naivebayes/naivebayes.pdf)

5 Exploratory Analysis

5.1 Ex Post Facto

Ex Post Facto variables that are generated after the start of the project. These are interesting to examine and can provide valuable insight into Kickstarter, however, they are not appropriate to use in our predictive models as they are pseudo outcome variables. Some, such as comments, can provide direction to a project creator on what to do mid-project to increase their cahnce of funding.

5.1.1 backers_count

backers_count is a powerful predictor of project funding. We can see that the distribution resembles a logistic function. We found it surprising that even at the top 5% of backers_count there are still projects that are not funded. We hypothesize that these projects have an extremely large goal.


5.1.2 comments_count

The vast majority of projects received fewer than 20 comments. Chance of Funding increases substantially as comments increases. The most notable feature is receiving as few as two comments can increase Chance of Funding by 30%+. We hypothesize that a good project has a causal relationship with more comments. The correlation is enough to advise any creator to make a concerted effort to start a conversation in the comments section of their project.


5.1.3 updates_count

Another solid indicator of funding updates_count. Just as with the other ex post facto variables, the causality is likely reversed as creators are probably more willing to update a project that is getting traction. Given the continued improvements throughout the deciles, it is surely worth regularly providing updates for your project to finish off the funding, or possibly move well past the 100% funded mark.


5.1.4 spotlight

We were surprised to find a variable with 100% predictive power occurring in over 20,000 projects. We dug deeper and found that spotlight denotes projects to be featured on Kickstarter’s recently funded page! Kickstarter Spotlight

## # A tibble: 2 x 3
##   spotlight `n()` `mean(funded)`
##       <dbl> <int>          <dbl>
## 1         0 21831              0
## 2         1 27519              1

5.2 Day Zero

Day Zero variables are any which can be observed and/or controlled at the start of the project. These are the most important for our predictive models as they allow us to predict a project’s funding before any Kickstarter activity.

5.2.1 goal

One of the most obvious, and ultimately significant predicters is goal. It shows a clear downward trend in funding success as the amount increases. While this is intuitive, it is worth noting that it does not appear linear. For this reason, we used the quantile function to account for this distribution.

5.2.2 category

Some categories never fail in this dataset (only considered if n > 50): * design/product design (1098 projects) * film & video/documentary (2202 projects) * film & video/shorts (3513 projects) * games/tabletop games (1064 projects)

Most successful parent categories (only considered if n > 100): * 7 1,725 projects 81.2% successful * 11 12,087 projects 65.2% successful * 14 14,635 projects 59.2% successful

Least successful parent categories (only considered if n > 100): * 18 8,725 projects 39.7% successful * 16 1,638 projects 46.2% successful * 12 4,125 projects 46.9% successful

## # A tibble: 15 x 10
##    category    count success_rate avg_goal med_perc_funded avg_perc_funded
##    <fct>       <int>        <dbl>    <dbl>           <dbl>           <dbl>
##  1 music       14635         59.2    7589.           102.             117.
##  2 film&video  12087         65.2   22409.           102.             301.
##  3 publishing   8725         39.7    7951.            19.8            336.
##  4 art          6122         51.2    9139.           100              260.
##  5 games        4125         46.9   38009.            41.2           1214.
##  6 design       1725         81.2   12981.           130              487.
##  7 technology   1638         46.2   68119.            47.0            195.
##  8 crafts         61         96.7    3235.           111.             146.
##  9 comics         57        100     11760.           176.             364.
## 10 theater        57        100      5994.           113.             127.
## 11 food           45        100     19737.           116.             147.
## 12 fashion        36        100     18178.           136.             580.
## 13 journalism     15        100     23405            111.             139.
## 14 dance          11        100      3165.           120.             120.
## 15 photography    11        100      9061.           166.             218.
## # ... with 4 more variables: avg_backer_count <dbl>,
## #   med_backer_count <dbl>, avg_contribution <dbl>, med_contribution <dbl>

5.2.2.1 Frequency by category

The distribution of projects by category shows Kickstarter has an intense focus on creative projects. We hypothesize that the minimal appearance of some categories suggests that Kickstarter’s classification system tends to favor large, general grouping. It may also be arbitrary in some instances as many dance and photography projects could readily be placed in art.

5.2.2.2 Funding by category

The chance of funding does not follow a similar pattern to the category frequency distribution. In fact, the sparsely populated categories have near perfect funding rates. Further investigation into these anomalies, such as exploring correlation to variables such as spotlight may reveal a selection bias for obscure classification.

5.2.3 ‘launched_at’

The number of projects increased exponentially 2009 - 2012 and seemed to be increasing more gradually after 2012. We only collected data through December of 2013 and anticipate continued growth for subsequent years.

We explored fundy by mo_launched to see if seasonality impacts Kickstarter. The most dramatic dips occur in May and December. This is consistent with our understanding of financial markets in general… they slow down early summer and have much lower volume around the holiday season.

5.2.4 country

The majority of projects are based in the United States. Domestic projects have a success rate about 7% higher than international projects. We believe two factors drive this difference. First, Kickstarter is a U.S. based company and will, therefore, better meet the needs of its customers. Secondly, crowdfunding requires a critical mass of people to support a project ecosystem. As backer are most likely to fund projects in their country, any new regional expansions will have lower success rates while the critical mass develops.

## # A tibble: 5 x 3
##   country count funded_rate
##   <fct>   <int>       <dbl>
## 1 US      47007       0.561
## 2 GB       1986       0.492
## 3 CA        278       0.464
## 4 AU         54       0.556
## 5 NZ         25       0.52
## 
## Call:
## lm(formula = funded ~ usa, data = df_country)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5610 -0.5610  0.4390  0.4390  0.5096 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49040    0.01026  47.814  < 2e-16 ***
## usa          0.07058    0.01051   6.717 1.88e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 49348 degrees of freedom
## Multiple R-squared:  0.0009133,  Adjusted R-squared:  0.0008931 
## F-statistic: 45.11 on 1 and 49348 DF,  p-value: 1.88e-11

5.2.5 photo_key

There are very few projects that do not at least have a photo. Consequently, a t-test shows no significant difference between having and not having a photo. This is probably not because photos don’t matter, but rather because the sample with no photo is too small to have statistical power.

## # A tibble: 2 x 3
##   photo_key `n()` `mean(funded)`
##       <dbl> <int>          <dbl>
## 1         0    25          0.68 
## 2         1 49325          0.558
## 
##  Welch Two Sample t-test
## 
## data:  funded by photo_key
## t = 1.2854, df = 24.026, p-value = 0.2109
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07413234  0.31899802
## sample estimates:
## mean in group 0 mean in group 1 
##       0.6800000       0.5575672

5.2.6 video_status

We hypothesized that video_status would be a powerful predictor as it is a proxy for whether or not the project has a video. We can see that the t-test ’video_status` to statistically significantly impact the success of the project.

df_engr %>%
  group_by(video_status) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(video_status)
## # A tibble: 2 x 3
##   video_status `n()` `mean(funded)`
##          <dbl> <int>          <dbl>
## 1            0  8882          0.408
## 2            1 40468          0.591
t.test(funded ~ video_status, data = df_engr)
## 
##  Welch Two Sample t-test
## 
## data:  funded by video_status
## t = -31.728, df = 13074, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1940137 -0.1714361
## sample estimates:
## mean in group 0 mean in group 1 
##        0.407791        0.590516

5.3 Social media connectedness variables

Social media shows an impact. Facebook seems to be the strongest and Youtube has a negative coefecient. Our hypothesis is that Facebook and Twitter may be used for promotion, while creators focusing on YouTube may over rely on their product content. Yet the most successfull creators have all three, which supports that YouTube is effective when paired with a comprehensive social media campaign.

#facebook
df_engr %>%
  group_by(facebook) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(facebook)
## # A tibble: 2 x 3
##   facebook `n()` `mean(funded)`
##      <dbl> <int>          <dbl>
## 1        0 36654          0.541
## 2        1 12696          0.605
#twiter
df_engr %>%
  group_by(twitter) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(twitter)
## # A tibble: 2 x 3
##   twitter `n()` `mean(funded)`
##     <dbl> <int>          <dbl>
## 1       0 46741          0.554
## 2       1  2609          0.617
#youtube
df_engr %>%
  group_by(youtube) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(youtube)
## # A tibble: 2 x 3
##   youtube `n()` `mean(funded)`
##     <dbl> <int>          <dbl>
## 1       0 45570          0.560
## 2       1  3780          0.528
#social_media_count
df_engr %>%
  group_by(social_media_count) %>%
    summarise(n(), mean(funded)) %>%
  ungroup(social_media_count)
## # A tibble: 4 x 3
##   social_media_count `n()` `mean(funded)`
##   <fct>              <int>          <dbl>
## 1 0                  34484          0.543
## 2 1                  11226          0.594
## 3 2                   3061          0.581
## 4 3                    579          0.615
df_engr %>% 
  ggplot(aes(x = social_media_count, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
  labs(title = "Funding By Social Media Count",
       x="Social Media Count",
       y="Chance of Funding")

5.3.1 campaign_duration

Interestingly, campaign duration has an inverse relationship to the likelihood of receiving funding; longer campaign are associated with higher failure rates.

df_camp_dur = data.frame(funded = df_engr$funded, campaign_duration = df_engr$campaign_duration)
df_camp_dur$cd_10 = cut(df_camp_dur$campaign_duration, breaks = unique(quantile(df_camp_dur$campaign_duration, seq(0, 1, by = .1))), include.lowest = TRUE)

df_camp_dur %>% 
  group_by(cd_10) %>%
  summarise(avg_funded = mean(funded)) %>%
  ggplot(aes(x = cd_10, y = avg_funded)) +
  geom_bar(stat="identity", fill = "#332288") +
  labs(title = "Funding By Campaign Duration",
       x="Number of Comments",
       y="Chance of Funding")

rm(df_camp_dur)

5.4 $avg_contribution

Appears to have a skewed normal distribution with a mean of $72.

#Chck for NA
anyNA(df_engr$avg_contribution)
## [1] TRUE
mean(df_engr$avg_contribution, na.rm = TRUE)
## [1] 72.70791
# Apparently normal distribution
df_filtered_by_avg_contribution <- df_engr %>%
  filter(avg_contribution < 1500)

# Box Plot
boxplot(df_filtered_by_avg_contribution$avg_contribution)

# Histogram
df_filtered_by_avg_contribution %>%
  ggplot() + 
  geom_histogram(aes(x = avg_contribution))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Clean up
rm(df_filtered_by_avg_contribution)

5.5 $description_length

df_engr %>% 
  ggplot(aes(x = description_length_10, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Description",
    x="Length of Description", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

5.6 $rewards

df_engr %>% 
  ggplot(aes(x = reward_length_10, y = funded)) +
  stat_summary(geom = "bar", fun.y = "mean", fill = "#332288") +
    labs(
    title = "Funding By Rewards",
    x="Length of Rewards", 
    y="Chance of Funding") + 
  theme_minimal() +
  theme_update(axis.text.x = element_text(angle = 60, hjust = 1))

5.7 Date variables

#Average of 2-20 minutes difference btween deadline and failed_at, successful_at, state_changed_at
#Difference is not meaningful, so remove failed_at, successful_at, state_changed_at
ticktock <- data.frame(db_cleaned$created_at, 
                       db_cleaned$deadline, 
                       db_cleaned$failed_at, 
                       db_cleaned$launched_at, 
                       db_cleaned$state_changed_at, 
                       db_cleaned$successful_at)

ticktock <- mutate(ticktock, 
                   deadline_failed = db_cleaned.deadline - db_cleaned.failed_at,
                   deadline_success = db_cleaned.deadline - db_cleaned.successful_at,
                   deadline_state = db_cleaned.deadline - db_cleaned.state_changed_at)

mean(ticktock$deadline_failed, na.rm = TRUE)
## Time difference of -52.16174 secs
mean(ticktock$deadline_success, na.rm = TRUE)
## Time difference of -112.5382 secs
mean(ticktock$deadline_state)
## Time difference of -85.82944 secs
rm(ticktock)

5.8 Not Fully Explored

More granular location variables would require more cleaning and may produce regional insights. * location_name * location_state * location_type * fx_Rate * profile_blurb * profile_state

5.9 Rejected

We looked at this, yet did not find them to be predictive: * project_id * disable_communicaiton

##$project_id
#A random identifier, cannot easily observe a pattern
range(db_cleaned$project_id)
## [1]      21109 2147466649

6 Data Dictionary

##                     Name
## 1                 funded
## 2         comments_count
## 3                   goal
## 4          updates_count
## 5          backers_count
## 6       full_description
## 7      campaign_duration
## 8       avg_contribution
## 9         percent_funded
## 10             spotlight
## 11            staff_pick
## 12                   usa
## 13          social_media
## 14              facebook
## 15               twitter
## 16               youtube
## 17    social_media_count
## 18             photo_key
## 19          video_status
## 20         reward_length
## 21    description_length
## 22         date_launched
## 23        mo_yr_launched
## 24           yr_launched
## 25           mo_launched
## 26               goal_20
## 27 description_length_10
## 28      reward_length_10
## 29              category
##                                                Description      Type
## 1              Amount pledged compared to goal by deadline    factor
## 2            Number of comments users post during campaign   integer
## 3     Goal set at beginning of campaign, in local currency   numeric
## 4         Number of times page was updated during campaign   integer
## 5        Number of backers that contributed to the project   integer
## 6                          Text description of the project character
## 7                         Days between launch and deadline   numeric
## 8        Mean amount pledged per backer, in local currency   numeric
## 9                             Percent of goal received (%)   numeric
## 10         If successful, indicates if project is featured    factor
## 11      Staff selected to receive 'Projects We Love' badge    factor
## 12      Indicates location in the US or in another country    factor
## 13 Indicates if creator provided any links to social media    factor
## 14                 Indicates if creator linked to Facebook    factor
## 15                  Indicates if creator linked to Twitter    factor
## 16                  Indicates if creator linked to YouTube    factor
## 17        Number of social media links provided by creator   integer
## 18               Indicates if the project page had a photo    factor
## 19               Indicates if the project page had a video    factor
## 20    Number of characters in reward structure description   integer
## 21        Number of characters in full project description   integer
## 22                     Date of project launch (yyyy-mm-dd)      Date
## 23              Month and year of project launch (mm-yyyy)      Date
## 24                           Year of project launch (yyyy)      Date
## 25                            Month of project launch (mm)      Date
## 26                                Ventile assigned to goal    factor
## 27              Decile assigned to full description length    factor
## 28            Decile assigned to reward description length    factor
## 29            One of 15 buckets categorizing project field    factor
##                                                                                                                                                                                                                                                                                                                                      Values
## 1                                                                                                                                                                                                                                                                                                                  0: failed; 1: successful
## 2                                                                                                                                                                                                                                                                                                                                0 - 393041
## 3                                                                                                                                                                                                                                                                                                                           0.01 - 21474836
## 4                                                                                                                                                                                                                                                                                                                                   0 - 301
## 5                                                                                                                                                                                                                                                                                                                                 0 - 87142
## 6                                                                                                                                                                                                                                                                                                                                       N/A
## 7                                                                                                                                                                                                                                                                                                                               1.5 - 91.96
## 8                                                                                                                                                                                                                                                                                                                                  1 - 9606
## 9                                                                                                                                                                                                                                                                                                                               0 - 4153501
## 10                                                                                                                                                                                                                                                                                                            0: no spotlight; 1: spotlight
## 11                                                                                                                                                                                                                                                                                                 0: no badge; 1: 'Projects We Love' badge
## 12                                                                                                                                                                                                                                                                                                               0: other countries; 1: USA
## 13                                                                                                                                                                                                                                                                               0: no links to social media; 1: one or more links provided
## 14                                                                                                                                                                                                                                                                                           0: no Facebook link; 1: Facebook link provided
## 15                                                                                                                                                                                                                                                                                             0: no Twitter link; 1: Twitter link provided
## 16                                                                                                                                                                                                                                                                                             0: no YouTube link; 1: YouTube link provided
## 17                                                                                                                                                                                                                                                                                                                               0, 1, 2, 3
## 18                                                                                                                                                                                                                                                                                                                0: no photo; 1: has photo
## 19                                                                                                                                                                                                                                                                                                                0: no video; 1: has video
## 20                                                                                                                                                                                                                                                                                                                              76 - 136827
## 21                                                                                                                                                                                                                                                                                                                               0 - 140229
## 22                                                                                                                                                                                                                                                                                                                 2009-04-24 to 2013-12-18
## 23                                                                                                                                                                                                                                                                                                                       01-2010 to 12-2013
## 24                                                                                                                                                                                                                                                                                                                              2009 - 2013
## 25                                                                                                                                                                                                                                                                                                                                  01 - 12
## 26 [0.01,500], (500,750], (750,1e+03], (1e+03,1.5e+03], (1.5e+03,1.8e+03], (1.8e+03,2e+03], (2e+03,2.5e+03], (2.5e+03,3e+03], (3e+03,3.5e+03], (3.5e+03,4.5e+03], (4.5e+03,5e+03], (5e+03,5.2e+03], (5.2e+03,7e+03], (7e+03,8e+03], (8e+03,1e+04], (1e+04,1.2e+04], (1.2e+04,1.6e+04], (1.6e+04,2.5e+04], (2.5e+04,5e+04], (5e+04,2.15e+07]
## 27                                                                                                                                           [0,754], (754,1.11e+03], (1.11e+03,1.44e+03], (1.44e+03,1.81e+03], (1.81e+03,2.22e+03], (2.22e+03,2.74e+03], (2.74e+03,3.46e+03], (3.46e+03,4.58e+03], (4.58e+03,6.64e+03], (6.64e+03,1.4e+05]
## 28                                                                                                                               [76,2.62e+03], (2.62e+03,3.71e+03], (3.71e+03,4.61e+03], (4.61e+03,5.49e+03], (5.49e+03,6.43e+03], (6.43e+03,7.52e+03], (7.52e+03,8.86e+03], (8.86e+03,1.08e+04], (1.08e+04,1.45e+04], (1.45e+04,1.37e+05]
## 29                                                                                                                                                                         art, comics, dance, design, fashion, food, film&video, games, journalism, music,\n                          photography, technology, theater, publishing, crafts